Word - level Deciphering Algorithm for egraded Document Recognition

نویسنده

  • Jonathan J. Hull
چکیده

A text recognition algorithm is proposed that uses word-level language constraints in a deciphering framework to directly decode the identity of each word pattern in an input text. This is a font-independent approach that solves problems of touching characters and character fragmentation. The major difficulty of using a deciphering approach on the word level is that the existence of relatively stable and reliable language constraints on the character level, such as character n-grams and a vocabulary of common words, usually do not scale up to the word level. A word-level deciphering approach is presented in this paper that solves a selected portion of an input text using a word-level relaxation deciphering algorithm. Font information is then learned from the solved portion of the text and used to re-recognize the rest of the text. Tests of the proposed approach on both artificially generated and scanned documents show satisfactory performance in the presence of image degradation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modified character-level deciphering algorithm for OCR in degraded documents

Modi cations to a previous character level deciphering algorithm for OCR are presented in this paper that are able to handle touching characters and are tolerant to mistakes made at the clustering stage The objective of a character level deciphering algorithm is to assign alphabetic identities to character patterns such that the character repetition pattern in an input text matches the letter r...

متن کامل

Language-Level Syntactic and Semantic Constraints Applied to Visual Word Recognition

Varions aspects of using language·level syntaCtic and semantic constraints to improve the performance of word recognition algorithms are discnssed. Following a brief presentation of a hypothesis generation model for handwriaen word recognition. varions types of language· level constraints are rev1?ed. Methods that exploit !hese characteristics are discussed including imra-document word conelati...

متن کامل

Handwriting Recognition (HR) of Family History Documents using a 2-D Warping-based Word-level HR Approach

An enormous amount of handwritten information exists that is potentially very useful for family history research. However, finding information of interest is a daunting task unless the handwriting is transcribed or indexed so that it can be digitally searched. Transcription / indexing is typically done manually because automatic handwriting recognition (HR) is not yet accurate enough to provide...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012